Reading: <a href ="http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf"> Generative and Disciminative Classifiers </a> by Tom Mitchell.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import norm
import seaborn as sns
sns.set_theme()
x1 = np.linspace(-10,10,1000)
x2 = np.linspace(-10,10,1000)
Generative classifiers:
Learns $f: \mathcal{X} \rightarrow \{0,1\}$, where
Learns $f: \mathcal{X} \rightarrow \{0,1\}$, where
$P(Y=1|X={\bf x})$ is modeled as:
\begin{eqnarray} P(Y=1|X={\bf x}) = \frac{1}{1+\exp\big(- (w_0+\sum_{j=1}^d w_j x_j)\big)} = \frac{\exp(w_0+\sum_{j=1}^d w_j x_j)}{\exp(w_0+\sum_{j=1}^d w_j x_j)+1} \end{eqnarray}It uses the logistic (or sigmoid) function:
\begin{eqnarray} \frac{1}{1+\exp{(-z)}} \end{eqnarray}(Note: the parameters are not tied to correspond to the naive bayes parameters on the previous slide. See comparisons at end of lecture)
z = np.linspace(-10,10,1000)
plt.plot(z,1/(1+np.exp(-z)))
plt.xlabel(r'$Z = w_0+\sum_i w_i X_i$',fontsize=16)
plt.ylabel(r'logistic$(Z) = P(Y=1|X)$',fontsize=16)
Text(0, 0.5, 'logistic$(Z) = P(Y=1|X)$')
Asking: is $P(Y=1|X)>P(Y=0|X)$ is the same as asking: $$\ln \frac{P(Y=1|X)}{P(Y=0|X)} >0?$$
i.e. is $$w_0+\sum_{j=1}^d w_j x_j ~ ~ >0?$$
This is a linear decision boundary!
from scipy.stats import multivariate_normal
# similar to previous example
mu_1_1 = -4; sigma_1_1 = 2;mu_2_1 = 4; sigma_2_1 = 2
mu_1_0 = 4; sigma_1_0 = 2;mu_2_0 = -4; sigma_2_0 = 2
cov_positive = np.array([[sigma_1_1**2,3], [3,sigma_2_1**2]] )
cov_negative = np.array([[sigma_1_0**2,3], [3,sigma_2_0**2]] )
# Sample data from these distributions
X_positive = multivariate_normal.rvs(mean=[mu_1_1,mu_2_1], cov=cov_positive, size = (20))
X_negative = multivariate_normal.rvs(mean=[mu_1_0,mu_2_0], cov=cov_negative, size = (20))
plt.figure(figsize=(6,6))
plt.scatter(X_positive[:, 0], X_positive[:, 1],facecolors='r', edgecolors='w')
plt.scatter(X_negative[:, 0], X_negative[:, 1],facecolors='b', edgecolors='w')
# hand picked line
plt.plot(x1, x1*0.8+0.5)
from labellines import labelLine
labelLine(plt.gca().get_lines()[-1],0.6,label=r'$w_0+\sum_j w_j x_j = 0$',fontsize=16)
plt.axis([-10,10,-10,10]), plt.axis('equal')
plt.xlabel(r'$x_1$',fontsize=20); plt.ylabel(r'$x_2$',fontsize=20)
plt.title('Data space',fontsize=20);
${\bf w}^\top{\bf x}= 0, ~ ~ ~ ~ ~ ~ ~ P(Y=1|X) = \frac{1}{2}$
${\bf w}^\top{\bf x} \rightarrow \infty, ~ ~ ~ ~ ~ ~ ~ P(Y=1|X) \rightarrow 1$
${\bf w}^\top{\bf x} \rightarrow -\infty, ~ ~ ~ ~ ~ ~ ~ P(Y=1|X) \rightarrow 0$
Let's focus on binary classfication $$ P(Y=1|X) = \frac{\exp({\bf w}^\top{\bf x})}{\exp({\bf w}^\top{\bf x})+1} $$ $$ P(Y=0|X) = \frac{1}{\exp({\bf w}^\top{\bf x})+1} $$
How to learn ${\bf w} = [w_0, w_1...w_d]$?
Training data: $\{({\bf x}^{(i)},y^{(i)})\}_{i=1}^n$, with $~{\bf x}^{(i)}=\left(x_1^{(i)},x_2^{(i)},...,x_d^{(i)} \right)$
Maximum Likelihood Estimation:
$$\hat {\bf w}_{\text{MLE}} = \underset{{\bf w}}{\operatorname{argmax}} \prod_{i=1}^n P(X = {\bf x}^{(i)},Y = y^{(i)}| {\bf w})$$
Problem: We don’t have a model for $P(X)$ or $P(X|Y)$ – only for $P(Y|X)$
Discriminative philosophy – Don’t waste effort learning $P(X)$, focus on $P(Y|X)$
Maximum (Conditional) Likelihood Estimation:
$$\hat {\bf w}_{\text{MCLE}} = \underset{{\bf w}}{\operatorname{argmax}} \prod_{i=1}^n P(Y = y^{(i)}|X = {\bf x}^{(i)},{\bf w})$$
$l({\bf w})$ concave, we can maximize it via gradient ascent
Gradient: $$\nabla_{{\bf w}} l({\bf w}) = \left[ \frac{\partial l({\bf w}) }{\partial w_0},...,\frac{\partial l({\bf w})}{\partial w_d} \right] $$
Update rule for gradient ascent, with learning rate $\eta>0$
Update rule for gradient descent, with learning rate $\eta>0$ $$\Delta{\bf w} = - \eta\nabla_{{\bf w}} l({\bf w}) $$
$$ w_i^{(t+1)} = w_i^{(t)} - \eta \frac{\partial l({\bf w})}{\partial w_i}\mid_{w_t} $$(maximizing $l({\bf w})$ is the same as minimizing $l'({\bf w}) = -l({\bf w})$)
Review, let's start with a simple function:
$$f(w) = 0.2(w - 2) ^2 + 1$$We know that this function is convex (2nd derivative exists and is positive).
f = lambda w: 0.2*(w-2)**2+1
dfdw = lambda w: 0.4*w - 0.8
plt.figure(figsize=(8,6))
w = np.linspace(-4,8,1000)
plt.plot(w, f(w), linewidth=3 )
plt.xlabel(r'$w$')
plt.ylabel(r'$f(w)$')
plt.title(r'Minimize $f(w)$, start with a random point $w_0$',fontsize = 24);
w_0 = 6
plt.plot(w_0, f(w_0), "o",markersize=10)
def draw_vector_2D(ax, x, y, lenx, leny,name,color='k'):
# grad = np.array([-np.sin(x),np.cos(y)])
ax.quiver(x,y,lenx, leny, color=color,angles='xy', scale_units='xy', scale=1)
ax.text(x+lenx/2, y+leny/2+0.5,name,fontsize = 16,color=color)
draw_vector_2D(plt, w_0, 0, dfdw(w_0),0, r'$\nabla f(w_0)$','k')
plt.figure(figsize=(8,6))
plt.plot(w, f(w), linewidth=3 )
plt.xlabel(r'$w$')
plt.ylabel(r'$f(w)$')
plt.title(r'Minimize $f(w)$, start with a random point $w_0$, step size $\eta=0.5$',fontsize = 24);
w_0 = 6
plt.plot(w_0, f(w_0), "o",markersize=10)
draw_vector_2D(plt, w_0, 0, dfdw(w_0),0, r' ','k')
eta=0.5
draw_vector_2D(plt, w_0, 0, - dfdw(w_0)*eta,0, r'$-\eta\nabla f(w_0)$','r')
plt.figure(figsize=(8,6))
plt.plot(w, f(w), linewidth=3 )
plt.xlabel(r'$w$')
plt.ylabel(r'$f(w)$')
w_1 = w_0 - dfdw(w_0)*eta
plt.title(r'Set $w_{t+1} = w_{t} - \eta \nabla f(w_t)$ and repeat', fontsize = 24);
plt.plot(w_0, f(w_0), "o",markersize=10)
plt.plot(w_1, f(w_1), "o",markersize=10)
draw_vector_2D(plt, w_1, 0, - dfdw(w_1)*eta,0, r'$-\eta\nabla f(w_1)$','g')
plt.plot(w, f(w), linewidth=3 )
plt.xlabel(r'$w$')
plt.ylabel(r'$f(w)$')
# w_1 = w_0 - dfdw(w_0)*eta
w_t = np.zeros(20)
w_t[0] = 7 # w_0
eta = 0.1
for i in range(1,20):
w_t[i] = w_t[i-1] - eta * dfdw(w_t[i-1] )
plt.title(r'Gradient descent with $\eta={}$'.format(eta), fontsize = 20);
plt.plot(w_t, f(w_t), "o-",markersize=10)
[<matplotlib.lines.Line2D at 0x1373cd2b0>]
x = np.linspace(-1,2,100);y = np.linspace(-1,2,100); X,Y = np.meshgrid(x, y)
f_XY = np.cos(X)+np.sin(Y)
plt.figure(figsize=(7,6))
cs = plt.contourf(X, Y, f_XY,20,cmap='RdBu_r',vmin=-1,vmax=1,alpha=0.6);plt.colorbar()
contours = plt.contour(cs, colors='k')
plt.xlabel('x');plt.ylabel('y')
plt.title('color indicates magnitude of function \n & gradient is perpendicular to level set', fontsize=24)
draw_vector_2D(plt, 1.45,0.5,-np.sin(1.45),np.cos(0.5),'','k')
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
%config InlineBackend.print_figure_kwargs = {'bbox_inches':None}
fig = plt.figure(figsize=(6,6))
ax = fig.gca(projection='3d')
Z = np.cos(X)+np.sin(Y)
# Plot the surface.
surf = ax.plot_surface(X, Y, Z, cmap='gray',
linewidth=0, antialiased=False, rcount=200, ccount=200)
ax.set_xlabel('x');ax.set_ylabel('y');ax.set_zlabel('f(x,y)');
ax.set_title('Alternative visualization, 3rd dim is magnitude', fontsize=16);
/Users/lwehbe/env/py3/lib/python3.7/site-packages/ipykernel_launcher.py:5: MatplotlibDeprecationWarning: Calling gca() with keyword arguments was deprecated in Matplotlib 3.4. Starting two minor releases later, gca() will take no keyword arguments. The gca() function should only be used to get the current axes, or if no axes exist, create new axes with default keyword arguments. To create a new axes with non-default arguments, use plt.axes() or plt.subplot(). """
Simple simulated example
# Previous example
mu_1_1 = -5; sigma_1_1 = 2;mu_2_1 = 5; sigma_2_1 = 2
mu_1_0 = 5; sigma_1_0 = 2; mu_2_0 = -5; sigma_2_0 = 2
cov_positive = np.array([[sigma_1_1**2,3], [3,sigma_2_1**2]] )
cov_negative = np.array([[sigma_1_0**2,3], [3,sigma_2_0**2]] )
# Sample data from these distributions
X_positive = multivariate_normal.rvs(mean=[mu_1_1,mu_2_1], cov=cov_positive, size = (20))
X_negative = multivariate_normal.rvs(mean=[mu_1_0,mu_2_0], cov=cov_negative, size = (20))
X = np.vstack([X_positive, X_negative])
Y = np.vstack([np.ones((X_positive.shape[0],1)),np.zeros((X_negative.shape[0],1))])
plt.figure(figsize=(8,8))
plt.scatter(X_positive[:, 0], X_positive[:, 1],facecolors='r', edgecolors='w')
plt.scatter(X_negative[:, 0], X_negative[:, 1],facecolors='b', edgecolors='w')
plt.plot(x1, x1*0.8)
plt.axis([-10,10,-10,10]),plt.axis('equal')
plt.xlabel(r'$x_1$',fontsize=20)
plt.ylabel(r'$x_2$',fontsize=20)
plt.title('Data space',fontsize=20);
w1x = np.linspace(-50,40,100)
w2x = np.linspace(-50,40,100)
W1,W2 = np.meshgrid(w1x, w2x)
## ommiting w_0 just for illustration
def loglikelihood(w1,w2):
w = np.array([[w1],[w2]]) # make w_vec
loglihood = np.sum(Y*X.dot(w) - np.log(1+ np.exp(X.dot(w))))
return loglihood
L_w = np.vectorize(loglikelihood)(*np.meshgrid(w1x, w2x))
plt.figure(figsize=(8,6))
cs = plt.contourf(W1, W2, L_w,20,cmap='Greens_r',vmin=np.min(L_w),vmax=0,alpha=0.6);
plt.colorbar()
contours = plt.contour(cs, colors='k')
plt.xlabel(r'$w_1$',fontsize=20)
plt.ylabel(r'$w_2$',fontsize=20)
plt.title('Parameter space',fontsize=20);
def gradient_likelihood(w1,w2,X,Y):
w = np.array([[w1],[w2]])
P_Y_1 = np.exp(X.dot(w))/(1+ np.exp(X.dot(w)))
gw1 = X[:,0:1].T.dot(Y-P_Y_1)
gw2 = X[:,1:2].T.dot(Y-P_Y_1)
return gw1, gw2
plt.figure(figsize=(9,7))
cs = plt.contourf(W1, W2, L_w,20,cmap='Greens_r',vmin=np.min(L_w),
vmax=0,alpha=0.6); plt.colorbar()
contours = plt.contour(cs, colors='k')
plt.xlabel(r'$w_1$',fontsize=20);plt.ylabel(r'$w_2$',fontsize=20)
plt.title('Parameter space',fontsize=20);
w1 = -5; w2 = -5
gw1, gw2 = gradient_likelihood(w1,w2,X, Y)
draw_vector_2D(plt, w1,w2,gw1/10,gw2/10, r'$\nabla L(w)/10$','k');
Iterate until convergence (until change $< \epsilon$)
for $j = 0...d$:
\begin{eqnarray} w_j^{(t+1)} = w_j^{(t)} + \eta \sum_j x_j^{(i)} \left [ y^{(i)} - \hat P(Y = 1|X = {\bf x}^{(i)}, {\bf w}^{(t)}) \right] \end{eqnarray}What is the $\left[ y^{(i)} - \hat P(Y = 1|X = {\bf x}^{(i)}, {\bf w}^{(t)}) \right]$ term doing?
Typically, the loss / function to minimize is a mean over the loss for each individual point: $ L({\bf w}) = \frac{1}{n}\sum_{i=1}^n L_i({\bf w})$.
We use $\nabla L_i({\bf w})$ instead of $\nabla L({\bf w})$. Since we sample over the points uniformly, they all have equal proability, and the expected value of $\nabla L_i({\bf w})$ is:
$\nabla L_i({\bf w})$ is faster to compute
Algorithm:
To reduce the noise in the gradients at individual datapoints, one approach is to:
Algorithm:
Ravines (where the surface curves steeply in one dimension) are problematic because SGD can oscillate.
$G_t$ is a diagonal matrix with the sum of the squares of the gradient of the $i$th element at each $i,i$ entry.
When using Adagrad, we don't need to tune the learning rate. However, as training goes on, the updates become smaller and smaller and the algorithm can stagnate. This is solved by the algorithms on the next slide.
Decision boundary: $$ w_0 + \sum_{j=1}^d w_jx_j^{(i)} = 0 $$ What about: $$ 10w_0 + \sum_{j=1}^d 10w_jx_j^{(i)} = 0 $$ or $$ 10000w_0 + \sum_{j=1}^d 10000w_jx_j^{(i)} = 0 $$
Same boundary! Which one has higher log likelihood?
$$l({\bf w}) \equiv \ln \prod_i P(Y = y^{(i)}|X = {\bf x}^{(i)},{\bf w}) = \sum_i \left [ y^{(i)} \left( {\bf w}^\top{\bf x}^{(i)} \right) - \ln \left( 1 + \exp \left( {\bf w}^\top{\bf x}^{(i)} \right) \right) \right] $$plt.figure(figsize=(8,6))
z = np.linspace(-10,10,1000)
plt.plot(z,1/(1+np.exp(-z)), label = 'a=1')
plt.plot(z,1/(1+np.exp(-10*z)), label = 'a=10')
plt.plot(z,1/(1+np.exp(-100*z)), label = 'a=100')
plt.legend(); plt.xlabel(r'$Z$',fontsize=16);
plt.ylabel(r'logistic$(a Z) = P(Y=1|X)$',fontsize=16);
/Users/lwehbe/env/py3/lib/python3.7/site-packages/ipykernel_launcher.py:5: RuntimeWarning: overflow encountered in exp """
${\bf w}\rightarrow \infty$ if the data is linearly separable
For MAP, need to define prior on $W$
A kind of Occam’s razor (simplest is best) prior
Helps avoid very large weights and overfitting
MAP estimation picks the parameter $W$ that has maximum posterior probability $P(W|Y,X)$ given the conditional likelihood $P(Y|W,X)$ and the prior $P(W)$.
Using Bayes rule again:
\begin{eqnarray} W^{MAP} &=& \underset{W}{\operatorname{argmax}} P(W|Y,W) = \underset{W} {\operatorname{argmax}} \frac{P(Y|W,X)P(W,X)}{P(Y,X)} \\ &=& \underset{W}{\operatorname{argmax}} P(Y|W,X)P(W,X) \\ &=& \underset{W}{\operatorname{argmax}} P(Y|W,X)P(W)P(X) ~~~~~ \text{ assume } P(W,X) = P(W)P(X) \\ &=& \underset{W}{\operatorname{argmax}} P(Y|W,X)P(W)\\ &=& \underset{W}{\operatorname{argmax}} \ln P(Y|W,X) + \ln P(W) \end{eqnarray}Zero Mean Gaussian prior on $W$: $$P(W_j=w_j) = \frac{1}{2\pi\sigma^2}\exp\big(-\frac{w_j^2}{2\sigma^2} \big )$$
\begin{eqnarray} {{\bf w}}^{MAP} = \underset{ {\bf w}}{\operatorname{argmax}} \ln P(Y|W,X) - \frac{\sum_{j=0}^d w_j^2}{2\sigma^2} \end{eqnarray}This “pushes” parameters towards zero and corresponds to Regularization
lmbda = 10 # this is 1/(2*sigma**2)
def logposterior(w1,w2):
w = np.array([[w1],[w2]]) # make w_vec
loglihood = np.sum(Y*X.dot(w) - np.log(1+ np.exp(X.dot(w))))
loglihood += - (w1**2 + w2**2)*lmbda
return loglihood
L_w = np.vectorize(logposterior)(*np.meshgrid(w1x, w2x))
plt.figure(figsize=(8,6))
cs = plt.contourf(W1, W2, L_w,20,cmap='Blues_r',vmin=np.min(L_w),
vmax=0,alpha=0.6);
plt.colorbar()
contours = plt.contour(cs, colors='k')
plt.xlabel(r'$w_1$',fontsize=20)
plt.ylabel(r'$w_2$',fontsize=20)
plt.title('Log posterior in parameter space',fontsize=20);
def gradient_posterior(w1,w2,X,Y):
w = np.array([[w1],[w2]])
P_Y_1 = np.exp(X.dot(w))/(1+ np.exp(X.dot(w)))
gw1 = X[:,0:1].T.dot(Y-P_Y_1)- 2*lmbda*w1
gw2 =X[:,1:2].T.dot(Y-P_Y_1) - 2*lmbda*w2
return gw1, gw2
plt.figure(figsize=(8,6))
cs = plt.contourf(W1, W2, L_w,20,cmap='RdBu_r',vmin=-np.max(np.abs(L_w)),
vmax=np.max(np.abs(L_w)),alpha=0.6); plt.colorbar()
contours = plt.contour(cs, colors='k')
plt.xlabel(r'$w_1$',fontsize=20);plt.ylabel(r'$w_2$',fontsize=20)
plt.title('Log posterior in parameter space',fontsize=20);
w1 = 7; w2 = -16
gw1, gw2 = gradient_likelihood(w1,w2,X, Y)
draw_vector_2D(plt, w1,w2,gw1/10,gw2/10, r'$\nabla ~ log~ likelihood(w)/10$','r');
gw1, gw2 = gradient_posterior(w1,w2,X, Y)
draw_vector_2D(plt, w1,w2,gw1/10,gw2/10, r'$\nabla ~ log~ posterior(w)/10$','b');
Consider these two assumptions:
Consider three learning methods:
How do these methods perform if we have plenty of data and:
Both (1) and (2) are satisfied.
Recall the decision boundary of GNB when the variances are exactly equal:
\begin{eqnarray} \ln \frac{P(Y=1|X_1...X_d)}{P(Y=0|X_1...X_d)} &=& C + G(X)\\ G(X) &=& -\frac{1}{2} \sum_i \big( x_{j} ^2(\frac{1}{\sigma_{j1}^2} - \frac{1}{\sigma_{j0}^2} ) - 2 x_{j} (\frac{\mu_{j1}}{\sigma_{j1}^2} - \frac{\mu_{j0}}{\sigma_{j0}^2} ) + (\frac{\mu_{j1}^2}{\sigma_{j1}^2} - \frac{\mu_{j0}^2}{\sigma_{j0}^2} ) \big ) \\ G(X) &=& \sum_j \big( x_{j} \frac{\mu_{j1}-\mu_{j0}}{\sigma_{j}^2} \big ) - \sum_j \big( \frac{\mu_{j1}^2 - \mu_{j0}^2}{2\sigma_{j}^2} \big ) \end{eqnarray}The decision boundary is linear, of the form: $\beta_0 + \sum_j \beta_j x_j = 0$.
The parameters are determined using the distance between centers, weighted by variance on each dimension.
Consider these two assumptions:
Consider three learning methods:
How do these methods perform if we have plenty of data and:
(2) is satisfied, but not (1).
Why doesn't GNB2 learn the same boundary as LR?
The decision boundary for GNB 2 is linear, of the form: $\beta_0 + \sum_j \beta_j x_j = 0$.
But each parameters is linked to the individual means and standard deviation of each dimension $x_j$, e.g.:
$$ \beta_j = \frac{\mu_{j1}-\mu_{j0}}{\sigma_{j}^2} $$GNB 2 is therefore less flexible than LR.
Consider these two assumptions:
Consider three learning methods:
How do these methods perform if we have plenty of data and:
Neither (1) nor (2) is satisfied.
Which method works better if we have infinite training data, and...
The bottom line:
GNB2 and LR both use linear decision surfaces, GNB need not
Given infinite data, LR is better or equal to GNB2 because training procedure does not make assumptions 1 or 2 (though our derivation of the form of P(Y|X) did).
GNB2 converges more quickly to its perhaps-less-accurate asymptotic error. (more bias than LR)
And GNB is both more biased (assumption 1) and less (no linearity assumption) than LR, so either might outperform the other.
LR is a linear classifier: decision rule is a hyperplane